-
Notifications
You must be signed in to change notification settings - Fork 143
fix: unify ColumnNotFound
for duckdb
and pyspark
#2493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I think I can make some other clean-up of repetitive code. I'll try tomorrow morning |
I made a followup PR #2495 with the cleanup :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for working on this! just got a comment on the .columns
usage
narwhals/_duckdb/expr.py
Outdated
@@ -186,7 +187,14 @@ def from_column_names( | |||
context: _FullContext, | |||
) -> Self: | |||
def func(df: DuckDBLazyFrame) -> list[duckdb.Expression]: | |||
return [col(name) for name in evaluate_column_names(df)] | |||
col_names = evaluate_column_names(df) | |||
missing_columns = [c for c in col_names if c not in df.columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.columns
comes with overhead unfortunately, I think we should avoid calling it where possible. How much overhead depends on the operation
I was hoping we could do something like we do for Polars. That is to say, when we do select
/ with_columns
, we wrap them in try/except
, and in the except
block we intercept the error message to give a more useful / unified one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah interesting, I was not aware π
What is happening in the background in duckdb that causes this overhead ? Do you have a link to the docs? (Just want to learn more)
Also, is it a specific caveat of duckdb? I don't think we should worry about that in spark-like but I might be wrong
I will update the code tonight anyway (but of course feel free to add commits to this branch if you need it for today's release)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.columns
comes with overhead unfortunately, I think we should avoid calling it where possible. How much overhead depends on the operation
@MarcoGorelli could we add that to (#805) and put more of a focus towards it? π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's documented, but evaluating .columns
may sometimes require doing a full scan. Example:
In [48]: df = pl.DataFrame({'a': rng.integers(0, 10_000, 100_000_000), 'b': rng.integers(0, 10_000, 100_000_000)})
In [49]: rel = duckdb.table('df')
100% ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
In [50]: rel1 = duckdb.sql("""pivot rel on a""")
In [51]: %timeit rel.columns
385 ns Β± 7.62 ns per loop (mean Β± std. dev. of 7 runs, 1,000,000 loops each)
In [52]: %timeit rel1.columns
585 ΞΌs Β± 3.8 ΞΌs per loop (mean Β± std. dev. of 7 runs, 1,000 loops each)
Granted, we don't have pivot
in the Narwhals lazy API, but a pivot may appear in the history of the relation which someone passes to nw.from_native
, and the output schema of pivot
is value-dependent (π© )
The same consideration should apply to spark-like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do those timings compare to other operations/metadata lookups on the same tables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.alias
for example is completely non-value-dependent, so that stays fast
In [60]: %timeit rel.alias
342 ns Β± 2.3 ns per loop (mean Β± std. dev. of 7 runs, 1,000,000 loops each)
In [61]: %timeit rel1.alias
393 ns Β± 2.6 ns per loop (mean Β± std. dev. of 7 runs, 1,000,000 loops each)
narwhals/_spark_like/dataframe.py
Outdated
try: | ||
return self._with_native(self.native.select(*new_columns_list)) | ||
except AnalysisException as e: | ||
msg = f"Selected columns not found in the DataFrame.\n\nHint: Did you mean one of these columns: {self.columns}?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure about this error message. I don't we can access the missing column names at this level, am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you're written is great - even though we can't access them, we can still try to be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I split the test into lazy and eager to simplify a bit the if-else statements. I hope it is a bit more readable ?
tests/frame/select_test.py
Outdated
return df | ||
|
||
if constructor_id == "polars[lazy]": | ||
msg = r"^e|\"(e|f)\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before it was msg = "e|f"
. Now it is a bit stricter
with pytest.raises(ColumnNotFoundError, match=msg): | ||
df.select(nw.col("fdfa")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before this was not tested for polars
tests/frame/select_test.py
Outdated
constructor_lazy: ConstructorLazy, request: pytest.FixtureRequest | ||
) -> None: | ||
constructor_id = str(request.node.callspec.id) | ||
if any(id_ == constructor_id for id_ in ("sqlframe", "pyspark[connect]")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqlframe
and pystpark.connect
raise errors at collect. π
I need to double check pystpark.connect
. Currently cannot set it up locally... Working on it β³
Do you have an idea on how to deal with these?
ColumnNotFound
for duckdb
and pyspark
/sqlframe
ColumnNotFound
for duckdb
and pyspark
df.drop(selected_columns, strict=True).collect() | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop
should be already tested in drop_test.py
msg = ( | ||
r"The following columns were not found: \[.*\]" | ||
r"\n\nHint: Did you mean one of these columns: \['a', 'b'\]?" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use parse_columns_to_drop
and therefore raise our ColumnNotFoundError.from_missing_and_available_column_names(missing_columns=missing_columns, available_columns=cols)
for every backend
I finally had a few minutes to go back to this π₯² Few notes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of unrelated errors I'll check later.
See #2593 and #2594 (thanks @MarcoGorelli )
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below